GTMNet: a vision transformer with guided transmission map for single remote sensing image dehazing

Existing dehazing algorithms are not effective for remote sensing images (RSIs) with dense haze, and dehazed results are prone to over-enhancement, color distortion, and artifacts. To tackle these problems, we propose a model GTMNet based on convolutional neural networks (CNNs) and vision transformers (ViTs), combined with dark channel prior (DCP) to achieve good performance. Specifically, a spatial feature transform (SFT) layer is first used to smoothly introduce the guided transmission map (GTM) into the model, improving the ability of the network to estimate haze thickness. A strengthen-operate-subtract (SOS) boosted module is then added to refine the local features of the restored image. The framework of GTMNet is determined by adjusting the input of the SOS boosted module and the position of the SFT layer. On SateHaze1k dataset, we compare GTMNet with several classical dehazing algorithms. The results show that on sub-datasets of Moderate Fog and Thick Fog, the PSNR and SSIM of GTMNet-B are comparable to that of the state-of-the-art model Dehazeformer-L, with only 0.1 times of parameter quantity. In addition, our method is intuitively effective in improving the clarity and the details of dehazed images, which proves the usefulness and significance of using the prior GTM and the SOS boosted module in a single RSI dehazing.

algorithms directly predict the latent haze-free images in an end-to-end manner. Huang et al. 7 proposed a conditional generative adversarial network that uses RGB and SAR images for dehazing. Mehta et al. 8 developed SkyGAN specifically for removing haze in aerial images, addressing the challenge of limited hazy hyperspectral aerial image datasets.
In recent years, Vision Transformer (ViT) 9 has excelled in high-level vision tasks, focusing on modeling longterm dependencies in data. However, earlier ViT and Pyramid Vision Transformer (PVT) 10 were over-parameterized and computationally expensive. Thus, Liang et al. 11 were inspired by Swin-Transformer 12 and proposed SwinIR consisting of several Residual Swin Transformer Blocks (RSTB), each with several Swin Transformer layers and a residual connection. Uformer 13 introduced a novel locally-enhanced window (LeWin) Transformer block and a learnable multi-scale restoration modulator in the form of a multi-scale spatial bias to adjust features in multiple layers of the Uformer decoder. Dong et al. 14 proposed TransRA, a two-branch neural network fused with transformer and residual attention, to recover fine details of dehazing RSIs. Song et al. 15 proposed Dehazeformer based on Swin-Transformer 12 and U-Net 16 , modifying the standardization layer, activation function, and spatial information aggregation scheme, and introducing soft constraints using a weak prior. The Dehazeformer has shown superior performance compared to previous methods on SOTS indoor datasets, while being more efficient with fewer parameters and lower computational costs. However, it is difficult to obtain sufficient paired hazy RSI datasets due to natural conditions and equipment limitations. When the training samples are small and contain dense haze images, the Dehazeformer performs poorly in RSIs dehazing.
To sum up, in RSIs dehazing tasks, both local and global features are important, and traditional image dehazing methods rely on sound theoretical foundations that can guide network learning. Thus, we have designed a new RGB remote sensing image dehazing model (GTMNet) based on Dehazeformer by reconstructing the model architecture and combining DCP into the proposed network. Due to the down-sampling operations in the encoder of the Dehazeformer, the compressed spatial information may not be effectively retrieved by the decoder of the Dehazeformer. Therefore, we use the strengthen-operate-subtract (SOS) strategy in the decoder to retrieve more compressed information and gradually restore latent haze-free images in this work. We also compare several advanced dehazing models with GTMNet and verify the applicability of the proposed model. For this paper, the main contributions are as follows: (1) A novel hybrid architecture is proposed, which is based on CNN and ViT, and combines the DCP. Compared with other referenced models, it provides better PSNR and SSIM; (2) The transmission map optimized by guided filtering and a linear transformation is smoothly introduced into the model through the spatial feature transform (SFT) layer, enabling better estimation of the haze thickness in the image and thus improving performance; (3) To gradually refine the restored image in the feature recovery module, the SOS boosted module is combined into the image dehazing task via a skip connection.

Proposed method
This section presents the details of GTMNet. First, we introduce the DCP. Then we estimate the transmission map. Finally, we describe the details of SFT layer, SOS boosted module and SK fusion module.
Dark channel prior. He et al. 4 conducted statistical analysis on non-sky regions of more than 5,000 hazefree outdoor images, and found that there are often some pixels with very low values in at least one color channel. Formally, the dark primary color of the haze-free image J(x) is defined as: where c represents a channel among R, G, and B channels; Ω(x) is a local square centered at x; J c represents a certain color channel of J . The observation shows that, if J is a haze-free outdoor image, except for the sky region, the pixel value of J dark tends to be 0. The above statistical observation is called the DCP or the dark primary color prior.

Estimation of transmission map.
To obtain a clear haze-free image J in Eq. (1), it is necessary to solve A and t. Equation (1) can be rewrite as: According to the DCP, the dark channel of a haze image approximates the haze denseness well. Therefore, He et al. 4 picked the top 0.1% brightest pixels in the dark channel of the hazy image. Among these pixels, the pixel with the highest intensity in the input image I is selected as the atmospheric light.
Assuming that the transmission in a local patch Ω(x) is constant, the patch's transmission t(x) can be defined as: www.nature.com/scientificreports/ As mentioned in the literature 4 , even if the weather is clear, distant objects are more or less affected by haze, so the authors control the degree of haze by introducing a factor ω of [0,1] to give a sense of depth of field. The specific expression is: where ω is usually taken as 0.95.
Due to the local assumptions, the estimated transmission map t(x) will exhibit block effects. In traditional image dehazing methods, t(x) is usually refined using the soft matting method, guided filtering, or fast-guided filtering. Although the soft matting method can achieve good results, the edge information of the object is weak and it is time-consuming. Therefore, we use a fast-guided filter for optimization 17 , in which the filter window radius is set to 60 and the regularization parameter e is 0.0001. Figure 1 shows the relevant results of transmission maps on the SateHaze1k dataset. We find that the transmission map optimized by the fast-guided filter in Fig. 1c can objectively estimate the hazy distribution of the input image. However, introducing the DCP in this paper aims to estimate the haze concentration. As shown in Fig. 1d, to highlight the haze thickness in the image, we used a linear transformation to enhance the optimized transmission map t and defined it as the guided transmission map (GTM) t1, which can be formulated as: GTMNet. As shown in Fig. 2 and Table 1, the proposed network GTMNet is based on Dehazeformer, but incorporates SFT layers 18 and SOS boosted modules. SFT layers integrate the GTM into GTMNet, which can effectively fuse the features of the GTM and the input image to more accurately estimate the haze thickness in the input image. SOS boosted modules can restore clear images iteratively. At the end of the decoder, a soft reconstruction layer is used to estimate the haze-free image J .
SFT layer. The SFT layer is first applied in super-resolution tasks 18 . It is parameter-efficient and can be easily introduced to existing dehazing network structures with strong extensibility. As shown in Fig. 3, we use the GTM t1 as the additional input of the SFT layer, which first applies three convolutional layers to extract the conditional maps φ from the GTM; then the conditional maps φ is input to the other two convolutional layers to predict the modulation parameters γ and β, respectively; finally, the transformation is carried out by scaling and shifting feature maps of a specific layer, and we can obtain the output shifted features by: www.nature.com/scientificreports/ where F is the feature maps with the same dimensions as γ and β, ⊙ is referred to the element-wise multiplication, i.e., Hadamard product, and ⊕ is the element-wise addition. Since the spatial dimensions are preserved, the SFT layer performs feature-wise manipulation and spatial-wise transformation. Since the size of each object is generally tiny in RSIs, obtaining local features becomes crucial. In this paper, we utilized SFT layers with shared parameters to compensate for the Transformer's limited ability to acquire local features.
SOS boosted module. The SOS boosting method 19 has been mathematically proven to be effective for image denoising, which iteratively restores clear images. Dong et al. 20 have verified a variety of optional SOS boosted modules, and the results show that the following boosted scheme has the best effect, as shown in Eq. (8): www.nature.com/scientificreports/   www.nature.com/scientificreports/ where Up(.) denotes the upsampling operator using a pixel shuffle method 21 , S n+1 represents the previous level feature, I n denotes the latent feature from the encoder, (I n + Up(S n+1 )) represents the strengthened feature, and G n θ n denotes the trainable refinement unit at the (n)-th level parameterized by θ n . According to the proposed architecture, Eq. (8) is written as Eq. (9): where J n+1 denotes the feature from the Dehazeformer block of the decoder. The SOS boosted module consists of three residual blocks, as shown in Fig. 4.
SK fusion module. Song et al. 22 designed a selective kernel (SK) Fusion module, which is inspired by SKNet 23 , to fuse multiple branches using channel attention. We use the SK Fusion module 22 to fuse the SOS and decoder branches. Specifically, let two feature maps x1 and x2 , a linear layer f (.) is first used to project x1 to x1 . Then a global average pooling GAP(.) , a Multilayer Perceptron MLP(.) , a softmax function and a split operation are used to obtain fusion weights, as shown in Eq. (10): Finally, weights {a1, a2} are used to fuse x1 , x2 with an additional short residual via y = a1 x1 + a2x2 + x2.

Experiments
In this part, we first present datasets and the implementation details of GTMNet. Then, we evaluate our method on RS-Haze and SateHaze1k datasets. Finally, ablation studies and other comparative experiments are conducted to analyze the proposed approach. 22 is a synthetic hazy RSI dataset synthesized from 76 RSIs containing diverse topography with good weather conditions and 108 cloudy RSIs. All the images are downloaded from the Landsat-8 Level 1 data product on EarthExplorer. The final training set contains 51,300 RSI pairs, and the test set contains 2,700 RSI pairs with an image resolution of 512 × 512. Since the proposed method is optimized on the Dehazeformer model, the experimental setup is consistent with the Dehazeformer 22 . We train the model using L1 loss for 150 epochs, each of which is validated once. The images in the test set are the same as those in the verification set.

Datasets. RS-Haze
SateHaze1k 7 is also a synthetic haze satellite remote sensing dataset, which uses Photoshop software as an auxiliary tool to generate rich, real and diverse hazy images. This dataset contains 1,200 RSI pairs, and each pair of images includes a hazy image and a real haze-free image. These images are divided into three haze image subsets: Thin Fog, Moderate Fog and Thick Fog, with an image resolution of 512 × 512. We select 320 pairs of images from each type of hazy image subset as the training set and 45 pairs of images as the test set. Each type of hazy image subset is trained and tested separately. Since the SateHaze1k dataset is small, we train GTMNet for 1000 epochs and verify it every ten epochs. Other experimental configurations are the same as those of the RS-Haze dataset. Implementation details. We provide four variants of GTMNet (-T, -S, -B and -L for tiny, small, basic, and large, respectively), implement the proposed network structure using the PyTorch framework, and train the model on an NVIDIA GeForce RTX3090. During training, images are randomly cropped to 256 × 256 patches. We set different mini-batch sizes for different variants, i.e., {32, 16, 8, 4} for {-T, -S, -B, -L}. The initial learning rate is set to {4, 2, 2, 1} × 10 -4 for the variant {-T, -S, -B, -L}. We use the AdamW optimizer 24 with a cosine annealing strategy 25 to train the model, where the learning rate gradually decreases from the initial learning rate to {4, 2, 2, 1} × 10 -6 .
The proposed mechanism for GTMNet training is illustrated in Algorithm 1. All the learnable parameters in GTMNet are initialized using the truncated normal distribution strategy 26 .  Tables 2 and 3, where bold indicates the optimal value and underline indicates the suboptimal value.

RS-Haze dataset.
Due to the equipment limitations, only testing and training are conducted on -T. We compare the proposed method with four other classical dehazing algorithms. As shown in Table 2, the PSNR of our method is slightly lower than that of Dehazeformer-T, while the SSIM of both is the same. Since the proposed architecture has more parameters, it is easier to overfit, resulting in poor generalization performance.
SateHaze1k dataset. We compare the proposed method with DCP 4 , DehazeNet 5 , Huang (SAR) 7 , SkyGAN 8 , TransRA 14 and Dehazeformer 22 , and the results are shown in Table 3. The PSNR and SSIM of GTMNet-T on the three sub-datasets are better than that of Dehazeformer-T 22 , especially, the PSNR on Thin Fog is improved by nearly 2.6%, and the SSIM is increased from 0.968 to 0.970. On Moderate Fog, the PSNR and SSIM of GTMNet-  7 and SkyGAN 8 , the SSIM metric improves by 8.7% and 5.2%, respectively, compared to the two algorithms. On the three sub-datasets, GTMNet-T achieves better PSNR and SSIM scores than TransRA 14 , with a significant improvement in PSNR performance. As shown in Table 3, combined with the quantitative comparison results above, the proposed model is still lightweight, although the parameters have increased slightly. On Moderate Fog and Thick Fog sub-datasets, GTMNet-B performs comparably to Dehazeformer-L, but with only 0.1 times the number of parameters. However, the performance of GTMNet-L is inferior to that of Dehazeformer-L, which may be caused by two aspects: Firstly, the increased parameter quantity of GTMNet-L makes it more prone to overfitting; Secondly, the generalization ability of GTMNet-L is reduced due to the small dataset.
Qualitative evaluation. A qualitative comparison of related methods was performed on the RS-Haze and Sate-Haze1k datasets. Since Song et al. 22 has compared the existing advanced dehazing image methods on RS-Haze dataset, we only present the dehazed images of GTMNet-T and Dehazeformer-T here. As shown in Fig. 5, there is little visual difference between GTMNet-T and Dehazeformer-T on the RS-Haze images, both showing clarity, rich feature information, realistic colours and a sense of hierarchy.
On SateHaze1k dataset, we present the qualitative comparison results of the GTMNet and state-of-the-art methods. The hazy input images include farmland, roads, buildings and vegetation, as shown in Fig. 6. We found that the DCP 4 method failed, possibly due to the similarity between the colors of the atmospheric light and the object. Although the method of Huang (SAR) 7 can remove haze, the ground feature information of the restored image in the dense haze area is not rich enough, and the building details are severely weakened. In general, both DehazeNet 5 and SkyGAN 8 failed to completely remove the haze (as shown in the processing result of the first hazy image in Fig. 6), resulting in unnatural color of the image and weak recovery ability for detailed information. Dehazeformer-T 22 and GTMNet-T solve the problem of incomplete image dehazing. However, for areas with thick haze or cloud haze, the Dehazeformer algorithm suffers from serious color distortion. GTMNet improves not only the problem of image color deviation but also the sharpness.
Ablation study. In this part, we perform ablation studies on the proposed model structure to analyze the factors that may influence the results. In these studies, except for different subjects, the other strategies are the same in each group of experiments.

The effects of different components on the model performance.
To study the influence of different components on the image dehazing effect, we take Dehazeformer-T 22 as the baseline model and conduct ablation experiments on different components on SateHaze1k dataset 7 .
As shown in Table 4, D-SOS-T refers to adding the SOS module to Dehazeformer-T. According to Table 5, we found that the PSNR and SSIM indicators of the three sub-datasets have been significantly improved, verifying the effectiveness of the SOS module in the image dehazing task. D-GTM-T indicates the introduction of the GTM as a prior into Dehazeformer-T through two SFT layers. The location of the SFT layer is shown in Fig. 9b. According to Table 5, the performance of adding only a prior GTM to Dehazeformer-T without using the SOS boosted strategy is better than that of Dehazeformer-T on Moderate Fog, but the effect is poor on Thin Fog and Thick Fog. We believe this is because the method for obtaining GTM is based on statistics for ordinary images, which have a large gap between RSIs and ordinary images. Traditional prior methods are more effective in uniform haze images.
As shown in Fig. 7, the haze-free images generated by Dehazeformer-T, D-SOS-T, and D-GTM-T all show building distortion. Among all the methods, the dehazing effect of GTMNet is the best, which can ensure the www.nature.com/scientificreports/ clarity of the restored image and better restore the color of the image. On Thin Fog and Thick Fog sub-datasets, the PSNR and SSIM indicators increase more when the two components are used together than when used separately.
The effects of different inputs of SOS1 module on the model performance. According to Eq. (8-9), we designed two different ablation models D-SOS-T and D-SOS1-T on SateHaze1k dataset. The specific configuration is shown in Table 6. According to Table 7, if S 2 is directly upsampled and input to SOS1 (Fig. 2), compared with D-SOS-T, PSNR decreases from 27.09 to 26.77 dB, and the value of SSIM remains unchanged on Moderate Fog. In addition, compared with Dehazeformer-T, PSNR and SSIM increase from 26.38 dB and 0.969 to 26.77 dB and 0.971, respectively. As seen in Fig. 8, there is very little visual difference between the dehazed images of D-SOS-T and D-SOS1-T. In the dense haze area, the color distortion is severe and the edge detail is lost, as shown in the results of the third hazy image in Fig. 8. To sum up, Up(J 2 ) is set as the input of SOS1 module.
The effects of SFT layer and GTM on the model performance. According to the structure of the model, the position of SFT layers can be categorized into four situations (as shown in Fig. 9): (a) using only one SFT layer in front of Dehazeformer block1, (b) using only one SFT layer behind Dehazeformer block5, (c) using an SFT layer in front of Dehazeformer block1 and behind Dehazeformer block5, respectively (i.e., GTMNet), and (d) using an SFT layer in front of Dehazeformer block2 and behind Dehazeformer block4, respectively. As shown  Table 4. Ablation models of different components.

Models SOS Number of SFT layers GTM
Dehazeformer-T 0 www.nature.com/scientificreports/ in Table 8, (d)-T has the highest PSNR and SSIM on Moderate Fog, but Table 9 indicates that GTMNet-B has a greater increase in PSNR and SSIM than (d)-B. Moreover, as seen from the comparison results in Fig. 10, the best dehazed result is achieved using GTMNet-T, with significantly improved image clarity and less severe image color distortion, especially in the third hazy image in Fig. 10.
Based on the results shown in Table 8, we conclude that adding GTM to both the encoder and decoder has a superior effect on removing haze from the Thin Fog RSIs, and adding GTM solely to the decoder has a better effect on removing haze from the Moderate Fog and Thick Fog RSIs. We believe that the effectiveness of GTM is not only related to the thickness of haze, but also depends on the presence or absence of SOS boosted modules.  Figure 7. Qualitative comparison of different components ablation models on SateHaze1k dataset. Table 6. Ablation models of different inputs to the SOS1 module. www.nature.com/scientificreports/   www.nature.com/scientificreports/ Different transmission maps can impact the dehazing performance of a model. In our experiment, we utilized two types of transmission maps: the transmission map optimized solely by guided filtering, named (c)-t-T, and the GTM obtained by optimizing the estimated transmission map via guided filtering and subsequently applying a linear transformation to it, which was used in GTMNet. As shown in Table 8, the GTM leads to higher PSNR and SSIM indicators on both Thin Fog and Thick Fog compared to the transmission map optimized solely by guided filtering. Moreover, the subjective visual evaluation and objective quantitative metrics results demonstrate that GTM is also suitable for local dense haze images and yields a remarkable dehazing effect.

Models SOS SOS1 Inputs
The effects of initial learning rate on the model performance. According to the training method in Dehazeformer 22 , the initial learning rate of the model decreases as the batch size decreases. Following the linear scaling rule, the initial learning rate of GTMNet-B should be 1 × 10 -4 . We performed ablation experiments on three sub-datasets Table 8. Quantitative comparison of ablation models of SFT layer and GTM on SateHaze1k dataset. Bold indicates the optimal value and underline indicates the suboptimal value. www.nature.com/scientificreports/ and found that if we reduced the initial learning rate on GTMNet-B, as shown in Table 10, the values of PSNR and SSIM generally decreased significantly, so we kept the initial learning rate constant, i.e., 2 × 10 -4 , even if we reduced the batch size of an iteration on -B.
Quantitative comparison of real-world images. In order to evaluate the generalization ability of the GTMNet, we select two real-world unmanned aerial hazy RSIs for testing. Overall, the Dehazeformer method is suboptimal; therefore, we only compare the results of GTMNet-T and Dehazeformer-T in this part and use the -T model trained on Moderate Fog to test the two real-world haze images. Figure 11 shows little visual difference between the processing results obtained by the proposed algorithm and Dehazeformer-T. Both methods produce clear, rich ground information, and realistic colors, suggesting that both algorithms are suitable for hazy remote sensing images in the real world. We have included additional visual comparisons in Supplementary Material to showcase the performance of our method on real-world images (Supplementary material).
The impact of dehazing results on subsequent tasks. Hazy images suffer from problems like low contrast, low saturation, detail loss, and color deviation, which seriously affect image analysis tasks, such as classification, positioning, detection, and segmentation. Therefore, in such cases, dehazing is crucial for generating images with good perceptual quality and improving the performance of subsequent computer vision tasks.
In this section, we analyze the impact of dehazing results on RSI water body segmentation. Firstly, we trained an RSI water segmentation network inspired by the U-Net for biomedical image segmentation 28 using 1500 RSIs and tested it using 300 RSIs. Secondly, we selected two images from the test set, added a moderate concentration of haze using Photoshop software, and tested the two images using the -T model trained on Moderate Fog. Finally, we qualitatively compare the results of water body segmentation for hazy inputs, dehazing results from GTMNet-T and Dehazeformer-T, and haze-free images. As shown in Fig. 12, there is very little visual difference between the dehazed images of GTMNet-T and haze-free images. However, the dehazed images of Dehazeformer-T have increased errors in the water body segmentation process compared to haze-free images. www.nature.com/scientificreports/

Conclusions
Combining the advantages of ViT and CNN, we propose a new RSI dehazing hybrid model GTMNet. The GTM is first introduced into the model using two SFT layers to improve the model's ability to estimate the haze thickness. The SOS boosted module is then introduced to refine the local features of the restored image gradually. The experimental results show that the proposed model has an excellent dehazing effect even for small-scale hazy RSI datasets, compensating for the lack of training data for current low-level visual tasks effectively and improving the model's applicability. Compared with state-of-the-art methods, GTMNet mitigates, to some extent, color distortion on the roof of buildings with high brightness and in dense haze areas.
We found that the effectiveness of the prior GTM depends on the presence of the SOS boosted module. Therefore, the strategy of introducing external prior knowledge is crucial. In future work, inspired by a dynamic memory network (DMN +) 29 to fuse target-related external knowledge and image features, and a multi-level features fusion network (MFFN) 30 to address the network redundancy, we will explore the self-weighted fusion strategy of the auxiliary data (e.g., Synthetic Aperture Radar image, GTM) and RSI features. In addition, we will further study strategies of combining traditional methods and deep learning-based methods, and design more suitable models to avoid overfitting. www.nature.com/scientificreports/

Data availability
All data generated or analyzed during this study are included in this published article.